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Abstract. We discuss a Bayesian model selection approach to high dimensional 
data in the deep under sampling regime. The data is based on a representation 
of the possible discrete states s, as defined by the observer, and it consists of 
M observations of the state. This approach shows that, for a given sample size 
M, not all states observed in the sample can be distinguished. Rather, only 
a partition of the sampled states s can be resolved. Such partition defines an 
emergent classification of the states that becomes finer and finer as the sample 
size increases, through a process of symmetry breaking between states. This allows 
us to distinguish between the resolution of a given representation of the observer 
defined states s, which is given by the entropy of s, and its relevance which is defined 
by the entropy of the partition Qs. Relevance has a non-monotonic dependence 
on resolution, for a given sample size. In addition, we characterise most relevant 
samples and we show that they exhibit power law frequency distributions, generally 
taken as signatures of “criticality”. This suggests that “criticality” reflects the 
relevance of a given representation of the states of a complex system, and does 
not necessarily require a specific mechanism of self-organisation to a critical point. 


1. Introduction 

In the study of complex systems - such as the brain, cells or our economies - we face 
conceptual issues of a novel type, because the systems studied involve many variables, 
many of which are unknown. In addition, their behaviour is not constrained by well 
established laws, as in physics. In such high dimensional inference problems one 
is hardly ever sampling correctly an underlying probability distribution, even with 
huge data sets. In order to evade the deep under-sampling domain, we implicitly 
or explicitly resort to dimensionality reduction schemes, where the data is projected 
into a low-dimensional space where statistics can provide accurate conclusions. Yet, 
in this process, the data processing inequality jT] tells us that we inevitably loose 
relevant information on the system’s “laws of motion”. So understanding which are 
the relevant variables is crucial in order to limit information losses. This requires 
guiding principles for the choice of dimensional reduction schemes, or for measuring 
the relevance of a given set of variables. 

Recently Ref. [2] suggested that the entropy of the frequency of observations (see 
later) can be used as a measure of relevance of a given representation of the data. 
This allows one to characterise most informative samples as those that maximise 
this measure, at a given resolution and for a given sample size. Remarkably, one 
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finds that most informative samples, in the under-sampling regime, have a power 
law frequency distribution j2]. 

This hnding sheds light on the widespread observation of “criticality” (i.e. power 
law frequency/size distributions) in empirical data [Hj ranging from language |1], 
statistics of natural images |5], neural activity Pd, city size distribution |8], to name 
just a few cases. In brief, this strongly suggests that the observed power laws usually 
associated with “criticality” arise as a consequence of our choice of relevant variables 
and that they do not necessarily require hidden mechanisms of self-organisation 
to a critical point [9j. Besides the academic interest of such an interpretation of 
“criticality”, its implication for data analysis are far reaching because the proposed 
measure of relevance can be used as a universal guiding principle in the search of 
optimal dimensional reduction schemes (e.g. data clustering) or for the identihcation 
of relevant variables (e.g. keywords in texts, relevant amino acids in proteins) [2]. 

The purpose of this paper is to ground the hnding of Ref. [2] in a model selection 
Bayesian framework, thereby clarifying its information theoretic basis. In brief, 
within this approach, we shall see statistical models of the data emerge from a process 
of symmetry breaking between data points in the samplef, acquiring more and more 
details as the size of the sample increases. In this way, model selection informs us on 
what resolution in the space of outcomes is justihed by the data. In order for different 
outcomes to be assigned different probabilities, the frequency with which they occur 
in the sample must be sufficiently different. Formally, this identihes an optimal 
partition which distinguishes outcomes that occur with different probabilities. The 
entropy of the size of the partitions provides a measure of the number of outcomes 
that can be distinguished in the sample (or of the number of parameters that can be 
estimated from the samples) and hence a measure of relevance. In what follows, for 
the sake of simplicity, we shall dehne and refer to this measure as relevance. 

The next section introduces the generic problem we deal with and discusses 
model selection. Simple examples are presented to provide the main intuition. We 
shall hrst show that, barring atypical cases, an upper bound to the relevance is given 
by partitions in frequency classes. Next we shall see that most informative samples 
are characterised by power law frequency distributions. This will be followed by an 
application to two different examples of real data sets. The results suggest that the 

f In what follows, a sample is a sequence of data points, each of which belong to a set of possible 
outcomes, which are defined a priori. 
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entropy of the frequency as suggested in Ref. |2|, can be used in place of the entropy 
of the optimal partition, which is computationally more demanding, as a measure of 
relevance. A hnal discussion will close the paper. 


2. The problem 

Let s = be a dataset of M observations of the state s of a system. 

Here is a discrete variable, that we can think of as the label of the cluster to 
which the observation belongs, or the conhguration s = (si,..., s„) of a system 
of n discrete degrees of freedom (e.g. the amino acid sequence of a protein domain). 
The number of possible different states s is much larger than M and it may even be 
unknown. We restrict attention to the case where can be thought of as outcomes 
of independent experiments, carried out in the same conditions. 

The general question of interest is to infer the laws governing the system, from 
the data. This can be formalised by assuming that the data can be thought of as M 
i.i.d. draws from a generative model where the function ps should 

encode the property of the system and the functions it performs. The basic problem 
then becomes that of inferring the generative model. 


2.1. Resolution and relevance 


Reference [2] has shown that, if we think of each sample as a realisation of an 
optimisation problem of a function U{s,s) over an enlarged set of variables that 
includes also unknown variables (s), then the frequency 


M 


kg — 'y ^ 


i=l 

with which a given observation s occurs in the sample provides a noisy estimate 
of that part Ug = of the function that is being optimised. Hence the 

relevance of the particular choice of the variables s, among all those that enter P, 
is reflected in the statistics of the frequency kg of states s. Ref. j2] argues that a 
quantitative measure of relevance^ in information theoretic terms, is given by 

k 


M 


M 


( 1 ) 


rrik = 




kg mk 


where 
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is the number of states that occur k times in the sample s. Notice that H[K] is the 
entropy of the random variable Ki = kgO) for a randomly chosen point of the 
sample. This is different from the entropy of the state s itself | 


J/|S] 




( 2 ) 


Intuitively, this measures the resolution of the description based on the variable s. 
Indeed a more detailed dehnition of the state s of the system likely results in a higher 
resolution (i.e. a larger value of) H[S] but not necessarily in a higher relevance H[K]. 


2.2. Learning the generative model 

Given a generative model AA = pg, the likelihood of s is dehned as: 

M M 

p(s\M) ==nrf’- =E'*■..<•>• p) 

i=l s i=l 

The frequentist approach estimates the best model as the one that maximises the 
likelihood. This results in equating probabilities with frequencies: pg = kg/M. 
The Bayesian approach, instead, invokes Bayes rule to turn the likelihood into a 
(posterior) distribution over the parameters p of the model. This requires identifying 
a prior distribution Po(p) fhat reflects our ignorance on p before seeing the data. 
A minimal requirement is that Po{fP> should be a symmetric function of the p^’s. 
Dirichelet priors 

are a mathematically convenient choice, and ignorance requires by symmetry that 
Og = a is independent of s. The posterior is easily computed: 

( 5 ) 

This allows us to give a Bayesian estimate of the probabilities 

{Ps)i = j dppgPlip) = (6) 

f Again we use uppercase for random variables defined on the space of the points in the sample s. 
Also we assume maximum likelihood estimates of the probability PIS' = s} = kg/M. 
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where S is the number of states. When M ^ aS this converges to the frequentist 
estimate kg/M, reminding us that in the presence of a large enough data set, the 
choice of the prior does not matter. 

There are a number of problematic issues with this procedure: 

(i) The set of possible states and their number S should be known in advance. This 
is not always the case. 

(ii) The estimate of the entropy H[S] = — J2s(Ps logPs)i is strongly affected by the 
prior and it converges slowly to its true value, as shown in Ref. US]. 

(iii) The model assumes a different parameter for each state that occurs in the 
sample. A posteriori, this assumption is not justified as there is nothing that 
can be learned from the data on how the probabilities of two states that are 
seen the same number of times differ. Indeed, the posterior estimate of these 
probability depends on the frequency kg and is exactly the same for two states 
s, s' that occur the same number of times kg = kg/. 

In particular, the last point suggests that we are in a clear case of over-fitting 
and indeed this model does not survive a model selection test, as we shall see in what 
follows. 

3. Model selection 

The key issue is that the definition of states s is made by the observer, not by the 
system. If the distinction between s and s' is totally spurious, we expect that the 
data will not distinguish between the two states, i.e. kg ^ kg/. Conversely, if two 
states are seen the same number of times, there is no reason to assume that they 
have a different probability. In terms of inference, we are not allowed to think that 
Pg 7^ pg/ unless we have sufficient evidence. 

3.1. An illustrative case: two states 

Let there be only two states s = 0,1 and assume there are M observations, k = ki 
with s = 1 and M — k with s = 0. There are two possibilities: one is that 
the two states are actually the same, i.e. that the underlying distribution has 
Po = Pi = 1/2, the other that the states are different, i.e. pi = p = 1 — po. 
These correspond to different models that we can identify with different partitions 
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of states and the associated probabilities. So the hrst case corresponds to a model 
A4o = [({0,1}, 1/2)] where the two states are symmetric, whereas the second to 
a model Mi = [({0}, 1 — p), ({l},p)]. Clearly P{s|Alo} = 2“^ whereas for Mi 
the likelihood P{s|Afi} can be obtained by integrating the likelihood over the prior 
distribntion of the parameter p, for which again we take a Dirichelet form. Hence 

r(2a)r(/c + a)r(M — k + a) 


P{s\Mi} = 


(7) 


r(a)2r(M + 2a) 

In order to compare the two models, we invoke Bayes rnle and compnte the posterior 
probability 

P{s\Mi)PoiM,) P{s\M,)Po{M,) 


P{M,\s) = 


E:Pis\M,)Po{M, 


Pis) 


where Po{Mi) is the prior probability of model i. For the sake of simplicity, we’re 
going to assnme that all models are a priori eqnally likely§. So the most probable 
model is the one with the highest likelihood P{s|Al}. In the present case, it is easy 
to check that, for M S> 1, in the representative case of a nniform prior (a = 1) we 
have that as long as 


k 1 ^ /log(2M/7r) 

M ~ 2 ^ V 


the symmetric model Mq shonld be preferred. 

Fignrej^ shows an extension for the 3-states case. Here the possible models are 
Mq with no parameters (each state has p = 1/3), Mi^i where two ont of the three 
states have the same probability (p* = p and Ps = (1 — p)/2 for z = 1, 2 or 3), and 
M 2 where all states have a different probability. If the freqnencies are close enongh, 
the states shonld not be distingnished and the model with no parameters shonld be 
preferred (bine snrface in . Conversely the red snrface reflect the cases where two 
states shonld not be distingnished from each other, and the green shows the case 
were the three states shonld be distingnished. 


3.2. The general case 

The argnment above snggests that, in the general case, for each pair of states s and 
s' their probability shonld be the same, nnless they occnr in the data a snfficiently 

§ By Occam’s razor, one would be tempted to prefer simpler models, i.e. those with fewer 
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150 150 

Figure 1. Model selection in a three state system with M = 150 observations. 
ki, k 2 and fca are the number of observations of each state. The coloured 
surface shows the preferred model in terms of the likelihood P{s\M.i). A4o is 
the model with no parameters {pi = 1/3, Vz), Ali is the one with one parameter 
{pi — p, pj = pk = (1 — p)/2), and AI 2 is the one with two parameters 
{pi = p, Pj = q,pk = (1 - (p + q))). 


different number of times. If kg kg' instead, they should be assigned the same 
probability, i.e. the symmetry between states s and s' should not be broken. 

Conversely, imagine the situation where the distinction between states s and s' 
is completely arbitrary, with no relation with the internal states of the system under 
study. Complete ignorance of the system about the distinction between states s and 
s' means that the probability distribution restricted to only these two states must 
be the one of maximal entropy, i.e. that pg = pg/. 

We remind again that the dehnition of states s is made by the observer, not by 
the system. If it distinguishes effectively different internal states of the system, then 
this definition is relevant and meaningful, otherwise it is not. One way to turn this 
observation into a quantitative criterium is to extend the model selection argument 
above. 

Given the set S of states s that are seen (with multiplicity kg > 0), then a 

parameters. Yet Occam’s razor already arises from the integration over the parameters implied by 
Bayes rule, without the need to introduce it ad hoc. 
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generic model M. = [Q, p\ is one where different states are divided into a partition 

N 

Q= {Q11Q2, ■ ■ ■ iQn)i U Qg = ‘5 

q=l 

of a nnmber N of disjoint sets, and each state in the subset of the partition 
(s G Qq) has the same probability|| p,q. If = \Qq\ is the number of states in subset 
Qq, then /ig satisfies the normalisation 

^mg/ig=l. ( 8 ) 

<? 

Any possible partition corresponds to a different model, going from the one 
where each state is in the same subset (s G Qi,Vs), to the one where each state is 
in a different subset (s G Qs,\/s). It is possible to consider more general structure 
that also includes yet not seen states (i.e. states with kg = 0). We shall see below 
that these are less likely than those considered here. Each partition Q identifies a 
different model Ai. This is why we shall use the partition Q to refer to the model 
that is based on that partition. 

It is straightforward to compute the likelihood of each model: 


F{S|Q} = K, = ^ i 

Q S^Qq 


where the prior contains the constraint Eq. 

(Dirichelet) priors 


(9) 


8|. We take again conjugate 


d®’M = r(aiV)p[ 


m 


r(a 




~ ^ 

\q&Q 


where N is the number of partitions in Q, i.e of parameters in Q. Then 

logP{S|S} = 5] 


, T(K, + a) ,, , 

log - - Kq\ogmq 


r(a) 


The posterior distribution, under model Q is 

Kq-\-a 


( 10 ) 


-iogb^^(n) 




+ a) 




Kq+a—l^ 


T{aN) 


( 12 ) 


II All quantities N, mq, Qq fiq depend on the model At. We omit this dependence for the sake of 
simplifying formulas. 
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The expected value of Ps for s E Qg 

/ I r\\ ^ ^ 

{Ps\Q)i = 


IS 


mg M + aN ’ 


Vs G Qq 


(13) 


where (... \ Q)i indicates expected values over the posterior distribution Eq. (12). 
The expected value of the entropy iif [S'] = — fogPs is given by 






^ M + aN 
q 


[ip{Kg + a + 1) - logm, - ^jJ{M + aN + 1)] (14) 


where 'f>{z) = is the digamma function. 

Assuming that all models are a priori equally likely, P{s| Q} is also proportional 
to the posterior probability P{Q|s} of model Q given the data. Therefore the optimal 
model is given by^ 

Q* = argmaxP{s|Q}. (15) 

The partition Q* identihes an emergent description of the system in terms of 
effective states g, that we shall call g-states. This is the statistical description that 
can be resolved on the basis of the dataset s. The states s G Q* in the same 
partition g cannot be distinguished one from the other, hence they all correspond 
to the same g-state. The variable g is associated to a distribution pg = mgpig, 
which is the probability to observe the g-state. The entropy of this distribution 
^[Q] — ~ 'YhqPq fogPg provides a quantitative measure of the amount of information 
that the data provides on the generative model. It’s expected value on the posterior 


distribution Eq. (12) 


(alOllS), = - ^ 


A' 


M + aA^ 


[f^{Kg a 1) - '0(M aN 1)] (16) 




Kr, 


M + aN 


log rUq 


(17) 


is what we shall call relevance. Indeed, this is a measure of the relevance of the 
original description based on the states s. Eq. (17) shows that {H[Q]\Q)^ < 
(77 [S'] Iwith equality if and only if all partitions Qg contain only one state 
{mg = 1 Vg). The next section illustrates the behaviour of this measure in some 
specihc examples. Before doing that, it is instructive to discuss the issue of unsampled 
states and two special cases, to make contact with the results of Ref. j2]. 


^ A fully Bayesian approach would entail considering all possible partitions Q with their probability 
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3.3. Unsampled states 


In many instances, the sample contains only a partial coverage of the set of possible 
states. There are two ways in which not yet sampled states could be included in one 
of the partitions Q discussed above. Either adding them to one or more of the sets 
Qq or augmenting the partition with a set Qq that includes all states with kg = 0. In 
the hrst case, the partition Q changes into one which is identical on all sets Qqi with 
q' ^ q and with Qq —)■ Q'^ = Qq[JQo, where Qq is the set of unseen states. Since 
kg = 0 for s G Qo, the count Kq does not change, and the change in the likelihood is 
given by —Kq log(l + mo/mg), where mo = IQol is the number of states s G Qo- Since 
the change in the likelihood is negative, the optimal partition Q* does not include 
not yet sampled states. 

The change in the likelihood when the unseen states are added to the partition 
in a new set, Q —)■ Q+o = (Q, Qo) can also be easily computed. The hrst two terms 
in Eq. 01 do not change, as Kq = 0, so the only difference is due to the fact that 
the number of sets increases by one: N —)■ iV + 1. Hence the change in the likelihood 


log 


^{s|Q+o} 


= - log 


r(M + aiV + a)r(aiV) 


(18) 


P{s\Q} ^T{M + aN)T{aN + a) 

is again negative. Hence models based on partitions that include unseen states are 
dominated by those discussed above, if they are considered equally likely a priori. 

Yet, if one expects that the sample contains only a partial coverage of the set of 
possible states, the uniform prior hypothesis needs to be revised. Therefore 


log 


P{Q+o|g} 

P{Q\s} 


— An 


M-1 

E 

fc =0 


log 1 + 


k + aN 


(19) 


where Aq = log encodes the a priori likelihood that states s that are not 

present in the sample s exist. Notice that the second term in Eq. (19) increases with 
M (as a log(l + M/{aN)) for M, iV 3> 1). Hence for a given Aq, we expect the model 
Q to become preferable to Q+o as M grows large. When instead the model Q+o is 
the optimal, this approach also gives an estimate of the discovery probability 

a 


Po = ... ... (20) 

M + aN + a 

which is an intense subject of research in statistical learning^, since the work of Good 


n{Q|s}. Here we depart from this approach and focus on the most likely partition. 

+ This discussion relates to the wider field of non-parametric Bayesian statistics which discusses 
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and Turing |12j . 


3 . 4 . Special cases 


For the model based on the atomic partition S, where each subset contains one state 
Qs = {s} 


logP{s|5} = E log 


T{ks + a) , r(M + aNs) 


k 


rrik log 


r(a) 
r(fc + a) 


log 


r(a) 


log 


r(aiV,) 

T{M + aNs) 

r(aiV,) 


( 21 ) 


( 22 ) 


where Ng = |5| is the number of different states s that occur in the sample s. Note 
that ms = 1 and Kg = kg is simply the frequency of state s. 

For the model based on the frequency partition /C, where subset Qk = {s '■ kg = 
/c} for fc = 1, 2,..we have = km^ and 


logP{s|/C} = 5]] log 


V{kmk + a) 

kmu 


log 


T{M + aNk) 

r(aiVfc) 


(23) 


■p' / \ ^ 

k r(a)mfc 

where Nk = |/C| is the number of different values of kg that appear in the sample. 

Naively one would expect that P{s|/C} > P{s|iS}, i.e. that the JC partition 
should always be preferred to the atomic partition S. Appendix C proofs that this is 
indeed the case for a = 1 and for a —)■ 0. But it also exhibit counterexamples where 
this is not so, in the limit of large a. These however correspond to rather atypical 
samples and no counterexample to the rule P{s|/C} > P{s|iS} has been found in 
the data we have analysed. This strongly suggests that, in practical terms, the /C 
partition should always be preferred to the S partition. 


4. Properties of the optimal partition Q* 

Finding the optimal partition Q* for a given sample s is a non-trivial task. It is 
reasonable to assume that partitions that merge states with adjacent frequencies are 

models that reproduce sampling processes. For a general introduction, the reader is referred to m- 
A model of the sampling process based on our approach departs from this literature in that non- 
parametric Bayesian models such as the Dirichelet’s process are based on a single partition (5 in 
this case) whereas we consider selecting the optimal partition for each M. Further discussion of this 
issue would bring us too far from the main aim of the present paper and will be dealt with elsewhere. 
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more likely than those that merge states with non-adjacent frequencies*. Therefore, 
it is enough to consider partitions where all states s E Qq have frequency kg which 
is larger than that of all states s' G Qg/ with q > q'. This leads us to the following 
heuristics to derive the optimal partition Q* of a hnite sample: 


(i) Starting from Q = )C: 

(ii) For every q = 1,, Nq — 1, dehne a new partition by merging the subsets 

Qq and Qq+i of the current partition Q and compute the change in the log- 
likelihood. 

(iii) If the largest increase in the log likelihood over all possible values of q is positive, 
then merge the corresponding subsets, update the partition Q accordingly and 
repeat the previous step. 

(iv) If the largest increase in the log likelihood over all possible values of q is negative, 
then return Q* as the optimal partition. 


In order to explore the properties of Q* we study ensembles where the states s 
are drawn from power law distributions P{s) ~ s“". This choice serves for generating 
data with a broad distribution of frequencies, such as those that are often observed 
in empirical studies. Varying a allows us to probe the merging algorithm proposed 
over a broad range of underlying distributions. 

Figure]^ gives a pictorial representation of the merging process during a typical 
run. Interestingly, visual inspection suggests that the frequencies of the optimal 
model Q* are evenly spaced in a logarithmic scale. 

One can think as well of variations to the algorithm such as selecting a favourable 
move at random in step (ii) instead of choosing the one that maximizes the likelihood, 
or merging triplets of subsets instead of pairs. We have seen that the overlaps in 
the hnal representations obtained using these variations in the algorithm are always 
larger than 90%. Moreover we see that for large samples the probability of hnding 
a representation with H[Q] greater than H[Q*] goes to zero, meaning that the later 
yields a more relevant description of the data. This issue is discussed in Appendix 

El 


* If ks^ > ks2 > ksg then a partition Q where si,S 3 G Qq, and S 2 G Qq^ will be dominated by 
partitions where either all three states are in different sets, or si, S 2 G Q', and S 3 G Q ',, or si S Q', 

yi H2 

and S 2 , S 3 G Q'^i , or they are all in the same set. 
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Figure 2. Illustration of the Merging Process. M = 10^ data points were drawn 
from a distribution P{s) ^ s““, with a = 1.2. The x-axis shows the estimated 
probability (131 for states in each subset Q G Q. The y-axis stands for the different 
partitions Q in the merging process from /C to Q*. 


4-1. Distance between Q* and /C and scaling with the sample size 


Figure shows the difference between Q* and /C as a function of the sample size 
M. Panel A shows the estimated parameters (ps) (Eq. 13) for both models and 
two sample sizes Mi = 10^ and M 2 = 10®. The states with higher frequency kg are 
not merged, so the partitions S, K and Q* overlap on the left tail of the curve on 
a number f of identical subsets of states. We estimated the Q* partition and the 
parameters Ps using priors with a ranging from 0.01 to 10. The different overlapping 
curves in panel A stand for the different values of a. Clearly neither the number 
of subsets in Q* {Nq*) nor the estimated parameters (ps) vary strongly with a. In 
the following analysis we set a = 1. Panel B shows that the overlap ^ between the 
two partitions scales with M with a non-trivial exponent ( 7 ) which depends on the 
underlying distribution parametrized by a (panel C). The number of parameters [N) 
in each partition gives a measure of the overfitting done in /C with respect to Q*. 
Panel D shows that N ~ M^ has a power law dependence on M with an exponent 8 
that depends on ajj (panel E). The exponent 5 for the Q* partition is smaller than 


U For the JC partition it is possible to show that 5 = 1/(1 +a). The argument relies on the fact that 
the frequency of state s approximates the probability kg/M ~ ^ s~°‘ as long as 1 is large 
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that of the /C partition implying that the difference between Nq* and Njq increases 
with M. 

An interesting observation is that the optimal partition Q* provides an estimate 
of the entropy of the underlying distribution that converges faster than that based 
on the S partition. The slow convergence of the entropy based on the S partition 
and its strong dependence on the prior where noticed in Ref. m, that also proposed 
a remedy based on treating a as a hyper-parameter in Bayesian inference. Figure 
1^ shows that the estimate based on the optimal partition Q* converges faster than 
finer representations, and that Bayesian inference and model selection are enough to 
have a reliable estimate of the entropy. This also suggests that the information kept 
in the coarser representation is truly relevant for characterising the sample, while the 
discarded information is noise associated with the finite number of data points. 


5. Criticality of maximally informative partitions 


Having provided a measure for the relevance of a given sample, allows one to 
characterise the typical properties of most relevant samples, i.e. of samples that 
are maximally informative. This question was partly addressed in Ref. j2], where 
an upper bound to the entropy H[K], for a given sample size M and at a given 
resolution H[S], was derived. Interestingly, this exercise shows that the distributions 
that achieve the upper bound in the under-sampling regime, are power laws, i.e. 

~ This suggests that “criticality”, i.e. the observation of scale-free 

frequency distribution, may be a consequence of choosing the most informative 
variables, and need not necessarily imply underlying mechanisms of self-organisation 
to a critical point. 

In Appendix A we revisit the argument leading to the upper bound and 
also derive a lower bound for H[K], showing that this is also achieved when the 
distribution of frequencies has a power law behaviour ~ 

The observation (see Fig. that model selection identifies partitions Q* with 
posterior probabilities {ps\Q*)i that are evenly spaced on a logarithmic scale, suggests 
that the same may be true for samples of a given size M, with a maximal {H[Q\ \ Q,*)i 


enough. We note that mu — ds/dk is the number of states s in an interval of frequency dk = 1, 
hence ~ 3°"^^ jM ~ The number of states corresponds to the value of k such 

that TOfe becomes of order one. Therefore ~ Interestingly, we also find that 7 = 5/2 

for the /C partition, to numerical precision. These relations do not hold for the Q partition. 
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Figure 3. Scaling of the optimal partition with the sample size M. Panel A 
shows the estimated parameters < ps > for each subset in partitions K, and Q*. 
The data was drawn from a power law distribution P(s) ~ with a = 1.2. For 
both partitions we show the estimated parameters for a sample of size Mi = 10^ 
and M 2 = 10 ®. ^ denotes the number of parameters which are identical under 
both models /C and Q*. N/c and Nq* are the number of parameters (subsets) in 
each model. The different overlapping red (black) curves correspond to estimations 
using different values for the prior parameter a, ranging from 0.01 to 10. Panels 
B-E show analysis using a = 1. Panel B shows the scaling of ^ with the sample 
size, for a = 1.2, which follows a power law ^ ~ Panel C shows the 

dependence of 7 with a. Panel D shows the scaling of the number of parameters 
in each model with the sample size, which follows a power law N ^ Panel 

E shows that the number of parameters in Q* grows slower with M than in K, for 
a wide range of systems (a). 
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Figure 4. Convergence of the estimated entropy under different models. 
Labels S, 1C, Q* stand for the bayesian estimates of the entropy (141 in the 
respective model using a flat prior (a = 1). ML stand for the maximum 
likelihood estimate (§• The dashed line is the true entropy of the underlying 
distribution P{s) ~ Error bars denote standard errors over 1000 samples 

of each size M. The inset shows the difference between the bayesian estimates 
based on the posterior distribution 


14[ ) and the likelihood of the model (111 
AiL[5] = ^ [(iL[iS]|Q)^ — (—]^ log(P{s|Q}))] where H* is the true entropy of 
the distribution. 


at a given resolution (i7[S']|Q*)i. 

Yet, in order to further corroborate this conclusion, one needs to resort 
to numerical simulations. To this end, we generated samples from Montecarlo 
simulations maximising the measures of relevance proposed above. The simulations 
consisted in the following steps: 

(i) Start with an arbitrary sample dehned by the frequencies k = (fci,..., /cat), with 
kg = M, and N the initial number of states. Without loss of generality sort 
the frequencies in decreasing order ki > ... > k^- 
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(ii) Consider every possible move of n samples from state i into state j ^ i for 
all i G [1, N],j G [1, iV + 1] and n, under the constraint that ki — n > ki+i and 
kj-i > kj + n. Notice that j = iV +1 implies defining a new state with frequency 
n. Conversely ii ki = 1 this state will disappear after moving the one sample to 
state j. 

(hi) Choose the move which maximizes the Lagrange function C = H[Q] + fiH[S], 
with Q = K and Q = Q* independently, for a fixed value of jj,. 

(iv) Repeat (ii)-(iii) until there is no favourable move. 


Keeping fv fixed in step (iii) and allowing H[S] to fluctuate accordingly during 
the simulation favoured the ergodicity of the dynamics with respect to hxing H[S] 
and maximizing H[Q]. We repeated the simulations for different values of /i G [—1,5]. 
For each value of /i we repeated the simulations with different initial conditions. This 
was not essential for the maximization of H[K] but for the maximization H[Q*] the 
process converged to local maxima strongly dependent on the initial conditions. We 
therefore varied the initial number of states N from 25 to 950, for a sample of size 
M = 1000, and performed 20 independent realizations for each initial resolution. The 
absolute maximum of H[Q*] + fiH[S] across realizations was kept for each p,. Panel 
A in figureshows the results for both relevance measures H[K] and H[Q*]. Values 
of p < — 1 yield the trivial result of H[Q\ = 0 and H[S] =0 which corresponds to 
the solution ki = M, kj^i = 0. p = —1 yields solutions with H[Q] = H[S] in the 
well sampled regime (left part of the diagram). In the case of Q = K, the solutions 
are of the form ~ k~^~^ (see Appendix A). Panel B shows the solutions obtained 
for /i = 1 which match the expected Zipf law. The dashed curves in panel A refer to 
theoretical upper and lower bounds for the value of H[K] (Appendix A). 


6. Application to real data 

In this section we compare the models based on the /C and Q* partitions in two 
applications to real data. The /C partition is derived directly and exactly from 
the data whereas the Q* partition requires a calculation that may be heavy and 
approximate. The scope of this section is to show that in practical cases, the /C 
partition is a very good approximation to the optimal one Q*. Intuitively, the reason 
why this is so relies on the fact that informative samples (those with a large H[Q*] 
or H[K]) have broad frequency distributions, and as we have seen, the Q* and /C 
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Figure 5. Maximally relevant samples. Samples of size M=1000 were generated 
via Monte Carlo simulations maximizing H\K] + yH\S] and H[Q*] + yH[S] (see 
main text), for fixed values of /i S [—1,5] which determine the resolution H[S] of the 
solutions. All entropies are normalised by log(M). Panel A shows the maximized 
7J[Ar](red) and iJ[(5*](blue) together with theoretical bounds for H[K] (dashed). 
The solutions expected are of the form ^ Panel B shows the solutions 

obtained for y = 1, which are power laws of exponent 2 (Zipf’s law). The dashed 
line is an approximate solution to the theoretical lower bound for H[K] where 
is assumed to be a poisson variable (see Appendix AI. 


partitions have a sizeable overlap in these cases. 

In the hrst example, we analyse a hnancial market data set of stock retnrns. The 
data set (nsed previonsly in [131113) span a period from 1st Jannary 1990 to 30th 
April 1999 (2249 time points) and it covers the M = 2000 most freqnently traded 
stocks in the New York Stock Exchange in that period. Assnming that retnrns 
are ganssian with a block diagonal correlation matrix allows one to gronp stocks in 
clnsters of “sectors”, by maximnm likelihood (see mills] for details). The clnster 
label Si of each stock z = 1 ,..., M identihes the S partition in this context. As the 
nnmber Ng of clnsters varies from 1 to M, the algorithm prodnces partitions S with a 
different resolntion H[S]. We compare the relevance of different levels of description 
by compnting H[K] and H[Q*]. Here the optimal partition Q* is obtained with 
the algorithm dehned in Section starting from 1C. Panel A in £gnre[^ shows both 
measnres of relevance as a fnnction of the resolntion H[S]. The dashed curves are 
theoretical upper and lower bounds to the estimate of the maximal value of H[K], 
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given H[S] and M (see Appendix A). Panel B illustrates the relation between the 
/C and Q* representations, at the resolution marked by the vertical dashed line in 
panel A. Bars in panel B denote the /C partition, whereas the colours indicate which 
frequencies were merged together to form the coarser optimal model Q*. Panels C 
and D provide a closer look at the distance between partitions /C and Q*. Panel C 
shows the overlap between the /C and Q* partitions at each resolution. The overlap 
was computed by the Adjusted Rand Index (ca), which is bounded above by 1 
and yields 0 when the overlap matches the one expected by chance. For illustrative 
purposes we show the overlap between shuffled versions of the partitions (red curve), 
which indeed yield constant zero for all resolutions. We point out that in the strongly 
under-sampled domain, both models are practically the same. Panel D shows the 
estimated parameters (Eq. 13) in the /C and Q* models, at the resolution marked by 
the vertical line in A. The dashed line is a Zipf law for comparison. 

As a second example, following Ref. [2], we consider the problem of identifying 
relevant positions in the sequence of amino acid that correspond to a particular 
protein domain. In brief, the data consists of Multiple Sequence Alignments (MSA) 
of M sequences of the same protein domain, across different species. We refer to 
[TB] for a detailed description, for our purposes here, suffice it to say that a protein 
domain can be identified by a sequence a = (oi,..., a^) of L amino acids, each being 
of one of 21 possible types (e.g. a, = R for valine, ai = A for alanine, etc) and that 
an MSA is a collection of M such sequences across different organisms or species. 
The key point is that, while the whole sequence is subject to a random process of 
mutations, there are features which need to be conserved in order to perform the 
function the protein is supposed to do. In order to understand which positions along 
the sequence are relevant for the biological function, we observe that each subset 
I C {1, 2,..., L} of the positions identifies a partition Si of the MSA data, whose 
elements s = {ai,i G I) are the subsequences of the domain on the positions i E I. 
From this we can define the /C and the optimal Q* partitions, and compute both the 
resolution Hi[S] and the relevance Hj[Q*] or Hi[K] corresponding to this subset of 
positions. This makes it possible to look for the most relevant subset of positions I*, 
as the one that maximises Hj[Q*]. This program is carried out in Ref. [16] to which 
we refer the interested reader. Here we confine the discussion to the comparison of 
the /C and Q* partitions. In brief, the maximisation (of either /C or Q*) is done 
using a Montecarlo algorithm for subsequences of a fixed number n of amino acids. 
We applied the algorithm to the Voltage Sensor Domain of ion channels (Pfam code 




-Q* Overlap 
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Figure 6. /C and Q* representations of a financial market data set of stock returns. 
Stocks where clustered at different resolutions (If [5]) by means of the algorithm 
developed in Ref. m- The relevance of each level of description is quantified by 
H[K] and H[Q*] in Panel A (all entropies are normalized by log(M)). Panel C 
shows the overlap between partitions JC and Q* at each resolution. The overlap 
index is the Adjusted Rand Index (ARI, Ref. [15]) which yields 1 for identical 
partitions and 0 when the overlap is expected by chance. The partition labels 
where shuffled before computing the ARI (red curve) to illustrate this case. Panels 
B and D refer to partitions JC and Q* of the data, at the resolution marked by the 
vertical dashed line in A (i?)!?] 0.75). Bars in Panel B show the 1C partition. The 

x-axis are the frequencies and the y-axis are the number of states seen with each 
frequency. The colours show the coarser Q* partition, obtained by the algorithm 
of section Panel D shows the estimated parameters with the model based on JC 
and Q*. The dashed line is a Zipf law for comparison. 
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PF000520). The data was the same as that used in Klein et al. [T7]. In brief, the 
algorithm of Ref. US] produces a distribution over the subsets I* of n relevant sites 
that allows one to compute the probability that a site is either selected or not in two 
different realisations of I*. Fig. [^shows that optimising the relevance of the /C or the 
Q partitions provides a sharp separation between relevant and irrelevant positions, 
which is sharper for the /C partition. In addition, the selected subsets of sites I'f and 
Jg* under the optimisation of H[K] or H[Q*] have a large relative overlap; in 90% 
of the cases, the two optimisation schemes yield the same prediction on whether a 
site that is relevant or not (see Fig. [^and the caption for details). 

These two examples suggest that, in practical applications, H[K] can be used 
as a proxy for H[Q*\ in measuring the relevance. This is particularly useful to avoid 
the optimisation leading from /C to Q* and speed up numerical calculations. 

7. Conclusion 

The Big Data revolution has made available data of unprecedented detail on the 
working of complex systems, such as cells, networks of neurons and the brain, 
ecologies, social networks, economies and financial markets. This, in particular, 
indicates that quantitative approaches typical of hard sciences can be extended to 
life sciences as well. Yet, the fact that such phenomena are not constrained by well 
known laws, as in physics, makes inference of behaviour a daunting task. Indeed, one 
is rarely in the circumstance where behaviour depends on only few variables that can 
be controlled. In such cases, the resolution of high dimensional data, is not given by 
the number of variables that one can measure simultaneously, but rather is limited 
by statistical errors induced by finite sample size. Dimensionality reduction schemes 
have to be invoked to adjust the resolution so that reliable statistical information can 
be extracted from the data. This inevitably introduces a tradeoff between relevance 
and resolution, which is addressed in this paper. 

The main contribution of this paper is to make this tradeoff explicit and 
quantitative in information theoretic terms, on the basis of a Bayesian model selection 
approach. We focus on the limiting case where the system under study is severely 
under-sampled and no other information apart the frequency of observations is 
available. There, models are in one to one correspondence with partition of the set 
of observed states. So while resolution is a measure of the number of different states, 
relevance can be defined in terms of the number of different elements in the partition. 


Criticality of mostly informative samples: A Bayesian model selection approach 23 



Figure 7. Self-overlap Otc and Oq- (filled circles) of the subset of site If and /q, 
produced by the optimisation of II\K\ and H[Q*], respectively, on the MSA for 
the voltage sensor protein domain. Self-overlaps are computed as the probability 
that a randomly chosen site is either selected or not in two different runs of the 
algorithm of Ref. m and are shown, in the plot, as a function of the number n of 
sites in I*. The overlap 0{K, Q) between the subsets If and Iq, is also computed 
in the same way, and it is shown with filled squares. The dashed line corresponds 
to the overlap between subsets of n sites chosen randomly among the L = 114 
possible sites. 


i.e. the number of different states that the data allows one to distinguish. We find 
that, as resolution increases from the coarser possible level, relevance increases up 
to a maximum, beyond which it starts decreasing. In the extreme limit where each 
observation is seen only once, relevance vanishes, signalling that data contains no 
relevant information on the system. 

The resolution (i.e. the number of sets in the partition) also provides a natural 
cutoff in the number of parameters that the data allows us to infer, beyond which 
inference would result in overfitting. The number of parameters (and of partitions) 
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increases with the sample size M. Loosely speaking, as M increases, the model 
passes throngh a seqnence of symmetry breaking transitions where more and more 
distinctions between states can be made. This process, indeed, bears well known 
formal analogies with the symmetry breaking process in physical systems when the 
temperature (here proportional to 1/M) decreases. 

There are several interesting directions for further research along these lines. 
One is to extend the approach in Section 3.3 to explore sampling processes na 
that are consistent with the Bayesian model selection scheme. The second is to 
exploit these results for inference of graphical models in cases where states can be 
considered as a conhguration s = (cxi,..., cr^) of an extended system. There, a 
well established technique is Boltzmann learning (see e.g. [7]) which, given a set of 
relevant observables, invokes maximum entropy principle and predicts a distribution 
P((Ji,..., an). The set of relevant observables determines the model. Yet, no general 
criterium exists for dictating what relevant observables should be and it seems natural 
to invoke model selection schemes to address the issue. 

Finally, the present approach also suggests a new perspective on the widespread 
occurrence of criticality. It suggests that the occurrence of broad frequency 
distributions is a consequence of sampling relevant variables in the under sampling 
regime. In this spirit the interesting question is not whether or why "biological 
systems are poised at criticality" jTH] but rather how to use the "apparent criticality" 
of frequency distributions to select relevant variables. 
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Appendix A. Samples that maximise H[K] have power law distribution 

The problem is to hnd the distributions G N that satisfy 
kmk = M, H[s] = 


E kruk , k 

-TT log Tit = Hq 


M " M 


and maximize 


£Vrrm \^kmk, kmk 


M ° M 


(A.l) 

(A.2) 
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The problem is difficult because it has to be solved for integer In order to 
circumvent this problem we think of as being drawn from a distribution and 


maximise the expected value of H[K], subject to the constraints that the expected 
value of H[s] and M = Yhk kmk are hxed. The main technical problem relies in 


computing the expected value of logm^. On one side, one can observe that 


E [mfc log mk] > Uk log Uk = E[mk ]. 

This makes it possible to derive an upper bound on the maximal value of H[K]. 


Indeed, one particular distribution of is one where mk = Uk for all k, with integer 
rifc. The maximisation over these distributions coincides with the original problem. 
Maximising 



over all real Uk > 0 with J2k^'^k = M and /cnfc log(fc/M) = —MHq, clearly 
produces an upper bound to the true solution. This upper bound, as discussed in |2] 
predicts power law distributions k ^ ^ with /x > 1. 

In order to derive a lower bound, we confine ourselves to a specihc class of 
distributions. More precisely, we take as Poisson variables with mean Uk and 
solve the problem of finding Uk such that the average of H[K] is maximised under 
the same constraints as above. Notice that this is akin to studying the problem in 
the analog Gran Canonical Ensemble where M is a allowed to fluctuate. What we 
need to check a posteriori is that the fluctuations of M are small compared to the 
mean. 

The only nontrivial part of the calculation has to do with computing the 
expected value of logm^, for which we use the formula 


so that, for a Poisson variable m with mean n, we find 



(A.3) 



(A.4) 


(A.5) 


(A.6) 
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The first expression can be used to check that 


E[m\ogm] ~ an^ + O(n^), 


a = — dz- 


lo log(l - z) 

for n 1, whereas the last shows that E[m log m] ~ nlogn for n 3> 1. 
Writing E[E] = E we hnd 




kuk 

M 


where 


C{n) = / dz- 


h^irik) + (p + 1) log ^ - A 


- 1 


PlHq — AM 


(A. 7) 


(A.8) 


/o log(l - z) 

Notice that the only problematic thing here is that we are not taking into account 
that M also is a random variable. Operationally, one could think at taking an 
ensemble of systems, all strictly satisfying the constraint. Then we define the 
ensemble average Uk of the and pretend that its distribution be Poisson, which 
seems reasonable. 

The extrema of E can now be computed: Uk will satisfy 


k 

UkCfuk) = A - (/r + 1) log — - C{nk) 


(A.9) 


that can be solved numerically foe each k. 

Notice that P{mk > 0} = 1 — therefore the expected number N of states 
s visited is 




Af = ^(l 

k 

In order to compute /cmax notice that 
P{k 


max < ?} = n 
k=q 


= 0| = e 


Finally, the variance of M is given by 


V{M) = k^V{mk) = 


rik 


(A.IO) 


(A.ll) 


(A.12) 


and the validity of the method relies on the fact that 

, V(M) 
hm = 0 

M—>-oo 


(A,13) 
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A rather crude approximation of the solution is possible if we take 
nkC\nk)+C{nk) ^ log(l+nfc/nc) = -(/x+l) log 

he 

for k < kc and = 0 for k > kc- With Uc = 0.38 the approximation is valid to less 
than 1% for n > 10 but it underestimates by 80% the true vale at small n (a larger 
value of no would give a best £t to the small n region). 

Within this approximation 

^ (A.15) 


(A.16) 

(A.17) 

(A.18) 


nk = nc\Y- 
kr 


and it is consistent to take 
C{n) ; 


1 + -) log 1 + -) -1 
n / V Ur 


Therefore 


H\s] 


srAkuk k 


k=l 


H\K] « + £ 


k=l 


kuk 

M 


m 


1 + — 
UkJ 


1 log 


Uk 

1 + — 
Ur 


Appendix B. Properties of C{n) 
For small n; 


C{n) = dz 


e-nz _ I 


(B.l) 


/o log(l - z) 

. /ox 1 . 2 1 , 3 1 . /4096\ 4 5 , 

~ log(2)n-log - n + - log — n-log - n + 0(n ) 

^ 2 ^\3j 6 ^\27) 24 ^ V3645y ^ ^ 


We can write 


7“ 7” dz 

C{n) = / dxe Ai{l—x/n) = logn+ / — 

./n ./n ^ L 


_ -2/n-n(l-e-^/") 


(B.2) 


where h(a;) = i?i(loga;) is the logarithmic integral function. 
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Appendix C. Comparison between the /C and the S partitions 


The partition /C is clearly preferable to S in the limit a —?• 0, as the likelihood ratio 
behaves as . We first argue that this is also the case for a = 1 (uniform prior) 

and then we analyse the opposite limit a —)■ cxd. 

Consider the /C partition of size N for a = 1. Suppose that there are m states 
that occur with frequency k, being therefore in the same subset in /C. Consider now 
a new partition Q in which we have atomised one of the m states to a new subset of 
size 1. We will show that the likelihood of the Q model is smaller than the one of /C 


pm} 

F{s|Q} 


(C.l) 


for a = 1. 


Using Eq. (11), equation (C.l) takes the form 


P{s\)C} 


P{s\Q} 
f{k,m) = 


= f{k,m)g{M,N) 
{km\) 1 


1 

1 - 

m 


fc(m—1) 


(C.2) 


(C.3) 


{k{m — l))!fc! 

where g{N,M) = is an increasing function of M and it decreases with N. So 
the worst case scenario is when M is small and N is large. This corresponds to an 
original K, partition with N — 1 subsets of size = 1 and k = 1, 2,..., iV — 1, plus 
the one subset of size m and frequency k = N from which we are atomising one 
state. This yields the smallest value of M, compatible with k, m and A, which is 

N{N -1) 


M* = km + 


(C.4) 


This gives g{N,M*) = —T—h The minimal value of g is now obtained for 

N* = \/2km, which implies that 

g{M, N) > g{M\ N*) = V2k^ + 

On inspection, it is easy to check that f{k,m) ■ g{M*,N*) is an increasing function 
of m, so it attains its minimum value at m = 2. Therefore 


P{s\]C} 


> 


2 1 
TT 2\fKk 


> 


= 1.128... > 1. 


(C.5) 


P{s\Q} 2Vvr/c 

Notice that the worst case limit of m = 2 is attained when the Q partition becomes 
exactly S. 
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Yet, in the limit of large a, the ratio of the likelihood may become less than one. 
In order to address this issue, we shall exhibit a specihc case for a —)■ cx). 

Let us split the log-likelihood ratio in three pieces: 


A (a) = log 


p{m 

P{5|5} 


log 


log 


T{kmk + mua)/Vimko) 

[r{k + a)/r(a)m|]”^'= 

T{kmk + mka)/T{mka) 


(C. 6 ) 


T{kmk -h a)/r(a) 


log 


r(M + aiV,)/r(aiV, 


T{M + aNk)/T{aNk) 

Writing A = Ai -|- A 2 -|- A 3 , that correspond to the three lines above, using Stirling’s 
approximation, it is easy to show that 


Ai ~ 




2a 


k:mk>l 

A 2 ~ - y^kmk log rrifc 






2a 


N 

A 3 ^ Mlog ^ - {N-^ - N:^)— + 0{a-^) 
The leading order term can be cast in the form 


A = M 


\og N,-H[S] 


M 


\ogNk-H[K] 


(C.7) 

(C. 8 ) 

(C.9) 

(C.IO) 


The first is the amount of information, in nats, that one gains from the knowledge 
of ps = ks/M (over the uniform distribution on s) whereas the second is the amount 
of information one gains from the knowledge of pk = kmk/M (over the uniform 
distribution on k). It seems intuitive that the first is larger then the second. 

Yet it is easy to find counterexamples: Take a sample with M = mk -|- k^ 
points, m states occur ks = k times and one occurs ko times, therefore Ng = m + 1 
and Nk = 2. Then pk = 1/(1 -f x) and p^o = x/{l + x), with x = ko/{mk) and 
Ps = {k/M,..., k/M, ko/M). Then 

^ ^ 1 TTi “1“ 1 1 

//[P] -H[K] = log m, A = log —-- log m 

1 + X 2 1 + X 


Then A < 0 for 


ko < mk 


log[2m/(m -I- 1)] 
log[(m -h l)/2] 
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Figure Dl. Relative relevance of the optimal partition Q* with respect to 
partitions Qg and Qg (see main text). 


For m = 2 this occurs for ko < 1.419 ■ k, for m = 3 k^ < 1.755 ■ k and for m = 10 
ko < 3.507-/c. These, however seem rather pathological samples that will not typically 
arise in a sampling process. 

Appendix D. Variations in the algorithm for defining Q* 

To check on the robustness of the algorithm presented in section]^ we compared the 
Q* partition with the solutions obtained via two variations of the algorithm. The 
first variation consists on choosing the pair of adjacent subsets to be merged in step 
(ii) at random, and accept the move if the likelihood increases. We name this solution 
Q2- The second variation consists in merging triplets of adjacent subsets, selected 
at random and accepting the move if the likelihood increases. We call this solution 
Q3. We draw 50 samples of size M from a distribution P{s) ~ s“", with a = 1 and 
compute the models Q2 and Q3 1000 times for each sample. Figure m shows the 
probability of finding a partition Q2 (Q3) with larger entropy than Q*. We see that 
in both cases this probability goes to zero for large sample sizes, meaning that the 
Q* partition is more relevant in that limit. We also computed the overlap between 
Q*, Q2 and Q3 finding overlaps (measured by the Adjusted Rand Index) over 90% 
for a wide range of parameters (a G [0.5, 3], M G [10^, 10®]). 
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